A Tutorial on Thompson Sampling

Russo et al. (2018)

Nadine Daum

2025-11-25

Problem Setting

  • Agent repeatedly chooses actions and receives stochastic rewards

  • Must balance exploitation (choose current best) & exploration (learn about uncertain options)

  • Bernoulli bandit as example: reward ∈ {0,1} with unknown success probability θₖ

  • Goal: maximize cumulative reward (minimize regret)

  • Regret = loss from not always selecting the optimal arm

  • Link to Robbins (1952): early exploration–exploitation dilemma

Exploration Problem

Illustration of uncertainty over action rewards via posterior distributions:

  • Action 1 fairly certain
  • Action 2 fairly certain
  • Action 3 very uncertain

motivates exploration

Motivation

Why Greedy Fails

  • Selects arm with highest current estimate → gets stuck if early samples are unlucky

  • May never discover the best arm, even with infinite time

  • Russo’s 3-arm example: greedy keeps exploiting arm 1 forever


Why ε-greedy is still not ideal

  • Exploration is random, not targeted → wastes samples

  • Treats all uncertainty equally (uninformative exploration)

  • Performs poorly in large action spaces


Goal: exploration driven by uncertainty, not randomness.

Thompson Sampling

Why TS is appealing

  • Randomized, but driven by uncertainty (not noise)

  • Automatically balances exploration vs. exploitation

  • Higher posterior uncertainty → higher probability of sampling

  • Intuition: “Act as if one plausible world is true, then learn if you’re wrong.”


Pseudocode

  1. Maintain a posterior belief over θ

  2. Sample a parameter θ̃ ~ p(θ | data)

  3. Choose the action optimal under θ̃

  4. Observe reward & update the posterior

  5. Repeat

Algorithms: Greedy vs Thompson Sampling

  • Greedy uses mean estimate θ̂ₖ = αₖ / (αₖ + β) → picks best guess → no exploration

  • TS samples θ̃ₖ ~ Beta(αₖ, β ₖ)→ stochastic exploration

  • Posterior updates are identical → only difference is sampling

Behavior: Greedy vs Thompson Sampling

  • Greedy commits early → gets stuck
  • TS explores proportionally to uncertainty
  • TS rapidly concentrates on optimal action
  • Clear divergence by ~200 periods

Regret: Greedy vs Thompson Sampling

  • TS regret → drops to 0 quickly
  • Greedy → constant regret (never recovers)
  • TS performs well both for:
    • fixed θ = (0.9, 0.8, 0.7)
    • average over random θ
  • In Short: Greedy never learns, TS does

Why TS works

  • TS minimizes an information ratio

  • Leads to regret bounds comparable to UCB

  • Simple, scalable, Bayesian

  • Works in many domains (online ads, recommendations, clinical trials)

  • Russo’s main result: logarithmic regret under broad conditions (no equations needed)

Open Questions & Discussion Points

  • Under what conditions does TS underperform & why?

  • How sensitive is TS to incorrect or uninformative priors?

  • How should TS be adapted for drifting or nonstationary rewards?

  • Can TS be extended reliably to contextual bandits & RL settings?

  • In what scenarios is UCB a better choice than TS?

Wrap-Up


“Explore when unsure, exploit when sure.”


  • Bandits capture the tradeoff: exploration vs. exploitation
  • Greedy strategies are simple but fail under uncertainty
  • Thompson Sampling = posterior sampling
  • TS naturally balances exploration and exploitation
  • Widely used in practice (recommendations, ads, online decision-making)